63 research outputs found
Copula Based Hierarchical Bayesian Models
The main objective of our study is to employ copula methodology to develop Bayesian
hierarchical models to study the dependencies exhibited by temporal, spatial and
spatio-temporal processes. We develop hierarchical models for both discrete and
continuous outcomes. In doing so we expect to address the dearth of copula based
Bayesian hierarchical models to study hydro-meteorological events and other physical
processes yielding discrete responses.
First, we present Bayesian methods of analysis for longitudinal binary outcomes using
Generalized Linear Mixed models (GLMM). We allow flexible marginal association
among the repeated outcomes from different time-points. An unique property of this
copula-based GLMM is that if the marginal link function is integrated over the distribution
of the random effects, its form remains same as that of the conditional link
function. This unique property enables us to retain the physical interpretation of the
fixed effects under conditional and marginal model and yield proper posterior distribution.
We illustrate the performance of the posited model using real life AIDS data
and demonstrate its superiority over the traditional Gaussian random effects model.
We develop a semiparametric extension of our GLMM and re-analyze the data from
the AIDS study.
Next, we propose a general class of models to handle non-Gaussian spatial data. The proposed model can deal with geostatistical data that can accommodate skewness,
tail-heaviness, multimodality. We fix the distribution of the marginal processes and
induce dependence via copulas. We illustrate the superior predictive performance
of our approach in modeling precipitation data as compared to other kriging variants.
Thereafter, we employ mixture kernels as the copula function to accommodate
non-stationary data. We demonstrate the adequacy of this non-stationary model by
analyzing permeability data. In both cases we perform extensive simulation studies
to investigate the performances of the posited models under misspecification.
Finally, we take up the important problem of modeling multivariate extreme values
with copulas. We describe, in detail, how dependences can be induced in the
block maxima approach and peak over threshold approach by an extreme value copula.
We prove the ability of the posited model to handle both strong and weak extremal
dependence and derive the conditions for posterior propriety. We analyze the extreme
precipitation events in the continental United States for the past 98 years and come
up with a suite of predictive maps
Spatio-temporal models of infectious disease with high rates of asymptomatic transmission
The surprisingly mercurial Covid-19 pandemic has highlighted the need to not only accelerate research on infectious disease, but to also study them using novel techniques and perspectives. A major contributor to the dificulty of containing the current pandemic is due to the highly asymptomatic nature of the disease. In this investigation, we develop a modeling framework to study the spatio-temporal evolution of diseases with high rates of asymptomatic transmission, and we apply this framework to a hypothetical country with mathematically tractable geography; namely, square counties uniformly organized into a rectangle. We first derive a model for the temporal dynamics of susceptible, infected, and recovered populations, which is applied at the county level. Next we use likelihood-based parameter estimation to derive temporally varying disease transmission parameters on the state-wide level. While these two methods give us some spatial structure and show the effects of behavioral and policy changes, they miss the evolution of hot zones that have caused significant difficulties in resource allocation during the current pandemic. It is evident that the distribution of cases will not be stagnantly based on the population density, as with many other diseases, but will continuously evolve. We model this as a diffusive process where the diffusivity is spatially varying based on the population distribution, and temporally varying based on the current number of simulated asymptomatic cases. With this final addition coupled to the SIR model with temporally varying transmission parameters, we capture the evolution of \hot zones in our hypothetical setup
Design of Probabilistic Random Forests with Applications to Anticancer Drug Sensitivity Prediction- 2016
Random forests consisting of an ensemble of regression trees with equal weights are frequently used for design of predictive models. In this article, we consider an extension of the methodology by representing the regression trees in the form of probabilistic trees and analyzing the nature of heteroscedasticity. The probabilistic tree representation allows for analytical computation of confidence intervals (CIs), and the tree weight optimization is expected to provide stricter CIs with comparable performance in mean error. We approached the ensemble of probabilistic trees’ prediction from the perspectives of a mixture distribution and as a weighted sum of correlated random variables. We applied our methodology to the drug sensitivity predic- tion problem on synthetic and cancer cell line encyclopedia dataset and illustrated that tree weights can be selected to reduce the average length of the CI without increase in mean error
S1: Supplementary Information for Article: A copula based approach for design of multivariate random forests for drug sensitivity prediction
Changes in performance with prior feature selection
Random forest (RF) is designed to create uncorrelated trees using random subsets of features in each node of each tree. RF by itself is a great tool for feature selection from a high dimensional set of features. But we observed that the prediction accuracy is improved when a prior feature selection (RELIEFF) [1] approach is implemented. Table A shows the performance of RF, VMRF and CMRF with and without RELIEFF feature selection in 2 drug sets of GDSC.
Performance Analysis for drugsets consisting of more 8 than two drugs
We have generated empirical copulas for the bivariate cases as they are able to capture all forms of dependency structures. However, generation of empirical copulas has high computational complexity along with the need for a significant number of training samples at each node. Thus for more than two drug responses, we have considered parametric copulas and the difference between Gaussian copula parameters generated using root node and split node samples instead of the integral difference between empirical copulas is used. To test our hypothesis that VMRF and CMRF will perform better than RF, we considered a drug set with 4 different drugs from CCLE with single common target between them and a drug set with 3 different drugs in GDSC with a common target between them. The CCLE set has 482 cell lines and the GDSC set has 308 cell lines. RELIEFF was used to reduce the feature space prior to random forest application. For simplicity, in this case, we’ve used 30% of the sample cell lines as training data and 70% of them as testing data
A Copula Based Approach for Design of Multivariate Random Forests for Drug Sensitivity Prediction
Modeling sensitivity to drugs based on genetic characterizations is a significant challenge in the area of systems medicine. Ensemble based approaches such as Random Forests have been shown to perform well in both individual sensitivity prediction studies and team science based prediction challenges. However, Random Forests generate a deterministic predictive model for each drug based on the genetic characterization of the cell lines and ignores the relationship between different drug sensitivities during model generation. This application motivates the need for generation of multivariate ensemble learning techniques that can increase prediction accuracy and improve variable importance ranking by incorporating the relationships between different output responses. In this article, we propose a novel cost criterion that captures the dissimilarity in the output response structure between the training data and node samples as the difference in the two empirical copulas. We illus- trate that copulas are suitable for capturing the multivariate structure of output responses independent of the marginal distributions and the copula based multivariate random forest framework can provide higher accuracy prediction and improved variable selection. The proposed framework has been validated on genomics of drug sensitivity for cancer and cancer cell line encyclopedia database
- …